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Abstract  We  introduce  a  preconditioner  that  can  be  both  constructed  and  applied  us¬ 
ing  only  the  ability  to  apply  the  underlying  operator.  Such  a  preconditioner  can  be  very 
attractive  in  scenarios  where  one  has  a  highly  efficient  parallel  code  for  applying  the  oper¬ 
ator.  Our  method  constructs  a  polynomial  preconditioner  using  a  nonlinear  least  squares 
(NLLS)  algorithm.  We  show  that  this  polynomial-based  NLLS-optimized  (PBNO)  pre¬ 
conditioner  significantly  improves  the  performance  of  a  discontinuous  Galerkin  (DG) 
compressible  Euler  equation  model  when  run  in  an  implicit-explicit  time  integration 
mode.  The  PBNO  preconditioner  achieves  significant  reduction  in  GMRES  iteration 
counts  and  model  wall-clock  time,  and  significantly  outperforms  several  existing  types 
of  generalized  (linear)  least  squares  (GLS)  polynomial  preconditioners.  Comparisons  of 
the  ability  of  the  PBNO  preconditioner  to  improve  DG  model  performance  when  employ¬ 
ing  the  Stabilized  Biconjugate  Gradient  algorithm  (BICGS)  and  the  basic  Richardson 
(RICH)  iteration  are  also  included.  In  particular,  we  show  that  higher  order  PBNO 
preconditioning  of  the  Richardson  iteration  (run  in  a  dot  product  free  mode)  makes  the 
algorithm  competitive  with  GMRES  and  BICGS  in  a  serial  computing  environment.  Be¬ 
cause  the  NLLS-based  algorithm  used  to  construct  the  PBNO  preconditioner  can  handle 
both  positive  definite  and  complex  spectra  without  any  need  for  algorithm  modification, 
we  suggest  that  the  PBNO  preconditioner  is,  for  certain  types  of  problems,  an  attractive 
alternative  to  existing  polynomial  preconditioners  based  on  linear  least-squares  methods. 
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1  Introduction 

1.1  Background 

In  a  variety  of  practical  applications  one  must  repeatedly  solve  a  large  system  of  linear 
equations  where  one  has  an  extremely  fast  parallel  code  for  applying  an  underlying  fixed 
linear  operator.  Coincident  with  this,  there  are  many  iterative  methods  (e.g.  GMRES) 
that  can  leverage  this  ability  as  they  rely  only  on  repeated  applications  of  the  opera¬ 
tor  and  hence  their  speed  is  directly  related  to  the  speed  with  which  the  operator  can 
be  applied.  Unfortunately,  many  of  these  methods  converge  very  slowly  without  proper 
preconditioning  which  can  be  problematic.  Although  common  approaches  to  precondi¬ 
tioning  such  as  incomplete  LU  can  be  very  effective  in  reducing  the  number  of  iterations, 
they  are  not,  generally,  able  to  exploit  the  same  parallel  structure  as  that  of  the  original 
operator.  This  makes  their  use  very  troublesome  because  they  may  only  be  practical  if 
one  can  construct  a  sufficiently  fast  code  for  applying  them.  Such  a  code  may  not  be 
feasible  for  a  variety  of  reasons  (cost  of  development,  parallel  structure  that  is  different 
from  that  of  the  original  operator,  etc.). 

Fortunately,  polynomial  preconditioners  do  not  suffer  from  these  drawbacks  as  they 
can  directly  leverage  the  existing  parallel  code  for  the  operator  and  hence  are  excellent 
candidates  for  problems  of  this  type.  With  this  in  mind,  our  goal  is  to  develop  a  method 
for  constructing  an  effective  polynomial  preconditioner  that  requires  only  the  existing 
code  for  applying  the  original  operator,  from  construction  to  application.  We  will  de¬ 
rive  and  then  demonstrate  the  effectiveness  of  this  approach  on  a  specific  test  problem 
involving  a  rising  thermal  bubble  (RTB).  We  considered  this  problem  previously  in  [3] 
using  an  implicit-explicit  (IMEX)  2-D  continuous  Galerkin  (CG)  model  of  the  Euler 
equations  in  Schur-complement  form  with  a  continuous  initial  temperature  distribution. 
Here  we  revisit  the  problem  using  a  2-D  DG  model  (without  Schur-complement  form1) 
and  both  continuous  and  discontinuous  initial  temperature  distributions.  We  choose  to 
use  the  RTB  test  case  in  particular  for  our  simulations  because  this  test  case  (from  all 
test  cases  typically  used  as  given  in  [6])  offers  the  largest  possible  gains  in  efficiency  for 
an  IMEX  method  since  the  stiffness  of  the  problem  derives  only  from  the  fast  acoustic 
waves  in  the  compressible  equations. 


1.2  Paper  Format 

Section  2  provides  an  overview  of  the  test  model  context  within  which  we  will  develop 
the  preconditioner.  In  particular,  we  begin  with  the  assumption  that  there  is  an  exist¬ 
ing  code  for  applying  the  underlying  operator  (e.g.  a  fast  parallel  code,  a  configurable 
hardware  implementation)  and  that  we  will  therefore  be  using  a  solution  algorithm  that 
only  requires  the  ability  to  apply  the  underlying  operator  (e.g.  GMRES).  Given  these 
preliminary  assumptions  we  wish  to  create  a  preconditioner  while  observing  three  basic 
premises: 

—  Premise  1:  It  must  be  possible  to  apply  the  preconditioner  using  only  the  existing 
matrix-free  code. 

—  Premise  2:  It  must  be  possible  to  construct  the  preconditioner  using  only  the  ex¬ 
isting  code. 

1  In  [13]  it  was  shown  that  deriving  the  DG  Schur-complement  form  for  general  boundary  conditions 
remains  an  open  problem. 
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—  Premise  3:  The  system  in  question  will  be  solved  many  times  and  hence  the  cost  of 
constructing  the  preconditioner  will  be  amortized  over  a  great  number  of  uses. 

This  section  also  includes  an  overview  of  the  iterative  solvers  that  we  will  evaluate 
in  terms  of  their  response  to  the  PBNO  preconditioner  and  several  comparison  precon¬ 
ditioners.  Section  3  provides  some  background  on  polynomial  preconditioning  including 
existing  generalized  least  squares  (GLS)  preconditioners  to  which  the  PBNO  precon¬ 
ditioner  will  be  compared.  Section  4  lays  out  the  mathematical  development  of  the 
NLLS-based  method  for  constructing  the  PBNO  preconditioner,  and  explains  how  the 
preconditioner  can  be  applied  in  a  completely  matrix-free  manner.  Section  4  describes 
how  the  comparision  GLS  preconditioners  are  constructed.  Section  5  presents  the  results 
of  the  PBNO  preconditioner  performance  in  the  DG  model  of  the  2-D  RTB  problem  in 
a  serial  computing  environment.  Section  6  summarizes  the  key  results  of  the  paper. 


2  Test  Model  Context 

The  test  model  context  for  our  development  is  a  high-resolution  nonhydrostatic  atmo¬ 
spheric  model  (NUMA)  as  described  in  [6, 7, 9, 8].  The  combination  of  high-resolution 
and  large  domain  size  associated  with  global  atmospheric  models  requires  as  many  as 
O(107)  elements  and  Ng  =  0(10°)  grid  points  (nodes).  As  a  result,  at  each  time-step  in 
the  IMEX  time  integration  process  there  is  a  need  to  iteratively  solve  a  very  large,  but 
sparse,  linear  system  of  the  form 

Aqn+1=R(qn),  (1) 

or,  using  standard  generic  notation 

Ax  =  b.  (2) 

In  Eq.  (1),  q  is  the  state  vector,  R  is  a  right-hand  side  operator,  and  the  matrix  A  is 
square,  invertible,  nonsymmetric,  and  fixed  (unless  adaptive  mesh  refinement  is  used). 
The  size  of  A  is  4 Ng  x  4 Ng  in  2-D  and  5Ng  x  5Ng  in  3-D  (both  are  non-Schur  complement 
forms). 

At  this  point  it  is  important  to  emphasize  that  system  matrix  A  in  Eq.  (1)  is  never 
constructed  in  either  a  global  or  elemental  form.  Rather,  the  algorithms  in  NUMA 
merely  accomplish  the  equivalent  action  of  matrix  A  on  state  vector  q.  Thus,  from  an 
implementation  perspective,  Eq.  (2)  is  effectively 

La{x)  =  b.  (3) 

The  above  fact  provides  the  motivation  behind  Premises  1  and  2  in  Section  1.2  above. 

To  facilitate  the  subsequent  analysis  and  comparison  of  results,  it  is  useful  to  define 
the  scaling  parameter 

\  _  |  Amax  |  +  |  ^  min  |  /  a\ 

Amid  —  ^  5 

where  Amax  and  A mjn  are  the  eigenvalues  of  system  matrix  A  with  the  largest  and 
smallest  moduli,  and  which  can  be  well-approximated  via  the  Arnoldi  method  for  a 
system  of  any  size.  Dividing  both  sides  of  Eq.  2  by  Amid  results  in  the  equivalent  system 

{A/ Ami(i)x  =  ( b/Amid)i  (5) 

which  we  will  hereafter  denote  more  concisely  by 


Ax  =  b. 


(6) 
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Fig.  1  Example  of  a  complex  spectrum  for  a  matrix  A  associated  with  a  2-D  DG  model  of  the  RTB  test 
case.  The  model  had  a  coarse  spatial  resolution  consisting  of  5-by-5  elements  and  5th  order  Lagrange 
polynomials,  based  on  the  Legendre- Gauss-Lobat to  (LGL)  points,  and  employed  a  2-stage  ARK  time- 
differencing  scheme  with  the  time-step  set  to  produce  a  Courant  No.  of  16.  The  eigenvalues  of  the 
unsealed  system  matrix  A  having  the  largest  and  smallest  modulus,  respectively,  appear  in  the  upper 
left. 


For  a  DG  model  of  the  RTB  problem  (for  which  a  Schur-complement  form  is  not  presently 
available),  the  spectrum  of  A  is  complex,  confined  to  the  right  half  of  the  complex  plane, 
and  largely  contained  within  the  unit  disk  as  shown  in  Fig.  1. 

Despite  the  need  to  solve  Eq.  (6)  iteratively,  the  attraction  of  IMEX  methods  is 
that,  under  suitable  dynamical  circumstances,  one  may  employ  time-steps  that  are  as 
much  as  100  times  (or  more)  greater  than  the  maximum  allowable  explicit  time-step. 
As  a  result,  IMEX-based  models  can  run  faster  than  explicit  models  provided  that  the 
number  of  iterations  needed  to  solve  Eq.  (6)  is  kept  under  control  by  a  sufficiently 
effective  preconditioner.  The  RTB  test  case  is  an  example  of  a  dynamical  scenario  that 
can  be  run  at  a  large  Courant  Number  in  an  IMEX  model,  and  thus  permits  a  large 
time-step.  In  [3]  we  described  this  test  case  and  used  it  to  show  the  ability  of  the 
EBSO  preconditioner  to  accelerate  the  iterative  solution  of  Eq.  (6)  when  A  is  in  Schur- 
complement  form.  Herein  we  will  use  the  same  test  case  to  illustrate  the  ability  of  the 
PBNO  preconditioner  to  solve  Eq.  (6)  regardless  of  whether  A  is  in  Schur-complement 
form  or  not,  and  thus  allow  us  to  use  it  to  precondition  DG  compressible  flow  models. 

Because  the  matrix  A  is  not  symmetric  positive  definite  (SPD)  our  selection  of  suit¬ 
able  iterative  solvers  is  limited  to  those  that  can  handle  non-SPD  systems.  In  [3]  we 
compared  the  performance  of  GMRES  [17]  based  on  the  Arnoldi  process  [16,  p.  165]  and 
transpose-free  BICGS  [19]  based  on  the  Lanczos  process  [16,  p.  234].  In  this  paper  we  also 
include  the  (Richardson)  RICH  algorithm  [19,  p.  22].  All  three  of  these  algorithms  will 
be  evaluated  on  the  basis  of  their  PBNO-preconditioned  performance  in  a  serial  comput¬ 
ing  environment.  The  rationale  in  selecting  these  three  solvers  is  that  they  demonstrate 
diverse  combinations  of  computational  and  communication  costs.  For  example,  GMRES 
applies  the  system  matrix  once  per  iteration,  but  if  n  is  the  number  of  iterations  required 
for  convergence,  then  the  number  of  dot-products  performed  by  GMRES  is  n2 / 2.  By 
contrast,  transpose-free  BICGS  applies  the  system  matrix  twice,  but  only  requires  6 n 
dot-products.  Like  GMRES,  RICH  applies  the  system  matrix  once  per  iteration.  How¬ 
ever,  unlike  either  GMRES  or  BICGS,  RICH  can  be  run  in  a  dot-product-free  mode, 
which  can  be  potentially  advantageous  in  a  massively-parallel  computing  environment. 
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If  a  linear  system  is  to  be  solved  only  once,  then  the  cost  to  construct  a  precondi¬ 
tioner  prior  to  its  use  in  the  solution  process  must  be  kept  small.  However,  in  many 
IMEX  applications  matrix  A  in  Eq.  (6)  remains  the  same  for  hundreds  or  thousands 
of  time-steps.2  Moreover,  in  a  typical  operational  numerical  weather  prediction  model 
only  the  initial  condition  specification  relative  to  the  basic  state  changes,  which  changes 
only  the  right  hand  side  of  Eq.  (6).  As  a  result,  matrix  A  can  remain  unchanged  for 
many  thousands  of  model  runs.  It  is  a  critical  observation  that  in  such  situations  ( i.e ., 
solving  the  same  system  with  different  right  hand  sides  many  hundreds  of  thousands  of 
times),  the  cost  to  construct  the  preconditioner  is  essentially  unimportant  since  it  will 
be  amortized  over  a  great  number  of  applications.  As  a  result,  it  becomes  reasonable 
to  construct  the  preconditioner  using  computationally  more  expensive  methods.  Hence, 
the  test  model  context  we  are  considering  clearly  satisfies  Premise  3.  In  [3]  we  showed 
how  a  NLLS  algorithm  could  be  used  to  construct  the  EBSO  preconditioner  in  a  process 
that  required  approximately  an  hour  of  computing  time  on  a  typical  desktop  computer. 
In  this  paper  we  show  that  a  similar  NLLS  algorithm  can  construct  the  PBNO  precon¬ 
ditioner  so  efficiently  that  even  when  matrix  A  is  not  in  Schur-complement  form  the 
construction  cost  of  the  preconditioner  is  completely  amortized  after  only  a  tiny  frac¬ 
tion  of  the  total  number  of  time-steps  in  a  typical  integration  of  the  RTB  problem  (see 
Section  5.4). 


3  Polynomial  Preconditioning  Background 

Generally  speaking,  preconditioning  methods  may  be  grouped  into  two  classes  that  have 
been  termed  implicit  and  explicit  [2],  where  the  latter  is  also  referred  to  as  sparse  ap¬ 
proximate  inverse  preconditioning  (See  [1]  for  a  thorough  discussion  of  the  two  classes). 
When  explicit  preconditioning  is  employed,  Eq.(6)  is  replaced  by  the  equivalent  system 
(here  using  left-preconditioning) 

(KA)x  =  Kb,  (7) 

where  the  matrix  I\  is  required  to  approximate  A-1  in  some  sense. 

In  polynomial  preconditioning  (see  [12]  for  a  thorough  discussion)  we  require  K  to 
be  a  low-order  polynomial  in  A 

m 

K  =  s(A)=^klAi.  (8) 

i= 0 

From  a  DG  modeling  perspective,  an  extremely  attractive  feature  of  making  K  a  poly¬ 
nomial  in  A  is  that  K  possesses  the  same  degree  of  parallelization  potential  as  A.  Thus, 
incorporating  a  polynomial-based  K  into  a  parallel  version  of  DG  NUMA  would  require 
no  additional  preconditioner-specific  parallelization  machinery  whatsoever.  Polynomial 
preconditioning  in  this  form  clearly  satisfies  Premise  1. 

Now  since  any  polynomial  in  A  shares  the  same  invariant  subspaces  that  are  common 
to  A  and  A _1,  the  requirement  that  K  approximate  A-1  is  effectively  the  requirement 
that 


a(K)  w  cr(A_1). 

(9) 

or  alternatively,  the  requirement  that 

a(KA)  «  a (/). 

(10) 

2  The  reason  the  matrix  remains  constant  with  time  is  due  to  some  judicious  choices  we  make.  First, 
we  linearize  the  nonlinear  system  about  a  spatially  dependent,  but  time-independent  basic  state.  Next, 
in  our  IMEX-RK  approach  we  only  use  SDIRK  methods  which,  in  the  Butcher  tableau,  have  constant 
diagonal  terms  (see  [8]). 
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The  simplest  polynomial  preconditioner  is  the  Neumann  preconditioner 

m 

sN(A)  =  Y/(I-A)i,  (11) 

i= 0 

which  is  based  on  the  Neumann  series  [4]  property  that  the  right-hand  side  of  Eq.  (11) 
must  converge  to  A^1  as  m  — >  oo  as  long  as  the  spectrum  of  A  is  contained  within 
the  open  unit  disk  centered  at  (1,0).  Neumann  preconditioners  contain  no  user-specified 
parameters.  Thus,  in  situations  where  they  can  be  applied  successfully,  they  basically 
function  as  a  zero-skill  benchmark  against  which  other  polynomial  preconditioners  may 
be  compared.  However,  the  spectrum  of  the  rising  bubble  problem  shown  in  Fig.  1  has 
many  small  eigenvalues  very  close  to  the  unit  circle,  and  we  have  verified  that  the  2-D 
IMEX  DG  NUMA  model  of  the  RTB  problem  will  not  reliably  converge  at  each  time-step 
using  a  Neumann-type  preconditioner. 

A  linear  least  squares  approach  to  formally  satisfying  condition  (10)  would  be  the 
optimization  problem 

kopt  <=  min  ||1  -  As(A))||2,  (12) 

A  Gcr(A) 

where  vector  kopt  =  [fco  . . .  km]T  contains  the  optimal  coefficients  of  the  polynomial  in 
Eq.  (8).  However,  due  to  the  intractability  of  (12),  it  has  been  customary  (based  on  the 
maximum  modulus  principle)  to  replace  (12)  by 

kopt  min  1 1 1  As(A)) | |w,  (13) 


where 

—  r  is  a  continuous  convex  contour  that  encloses  the  spectrum  of  a  Hessenberg  matrix3 

H  that  arises  from  applying  the  Arnoldi  process  to  A, 

—  w( A)  is  a  positive  weighting  function, 

—  and  the  norm  involved  is  induced  by  the  inner  product 

(f,ff)  =  J^fWffWu>(A)d\A\.  (14) 

Although  the  so-called  generalized  least-squares  (GLS)  preconditioners  that  result 
from  optimization  problem  (13)  clearly  can  be  constructed  for  systems  with  complex 
spectra,  often  this  approach  is  restricted  to  symmetric  positive  definite  (SPD)  or  sym¬ 
metric  indefinite  (SID)  systems  that  simplify  the  preconditioner  construction  process  by 
virtue  of  their  real  spectra  [14,11,12].  Moreover,  the  choice  of  the  functional  form  of  w 
is  typically  influenced  not  so  much  by  maximization  of  preconditioner  performance,  but 
more  by  the  desire  to  obtain  kopt  via  analytical  means,  such  as  those  made  possible  by, 
for  instance,  choosing  w  to  be  an  appropriately  translated/scaled  Chebyshev  weight  (see 
e.g.,  [16,  p.  385]). 

Past  examples  of  GLS  preconditioners  for  systems  with  a  complex  spectra  again 
tend  to  choose  w  for  analytical  convenience  (see  e.g.,  [15,  p.  159]  and  [16,  p.  387]). 
Furthermore,  an  additional  issue  that  arises  when  constructing  a  GLS  preconditioner 
for  a  system  with  a  complex  spectrum  is  that  a  degree  of  user-determined  arbitrariness 
is  introduced  with  regard  to  what  convex  form  (e.g.,  an  ellipse  or  polygon)  is  to  be  used 
to  enclose  the  system’s  spectrum  (see  e.g.,  [15,  p.  158]).  Figure  2  illustrates  how  the 
hull  of  the  spectrum  of  the  same  A  as  shown  in  Fig.  1  is  accurately  represented  by  the 
spectrum  of  the  Hessenberg  matrix  H  constructed  from  A,  and  also  shows  an  example  of 


3  See  Section  4.1  for  the  details  of  how  H  is  constructed. 
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Fig.  2  As  in  Fig.  1,  except  that  included  is:  i)  the  spectrum  (green  circles)  of  a  150  x  150  Hessenberg 
matrix  H  obtained  from  A  via  the  Arnoldi  process,  and  ii)  an  octagonal  example  of  contour  r  (solid 
blue  line)  constructed  so  as  to  enclose  the  spectral  hull  of  H .  For  comparison  purposes  with  H,  the 
dimension  of  A  is  3600  x  3600. 


an  octagonal  contour  r  constructed  from  knowledge  of  the  spectrum  of  H  only.  We  will 
use  this  octagonal  r  when  we  construct  several  GLS  preconditioners  (see  Section  4.4) 
to  compare  with  the  PBNO  preconditioner  in  terms  of  performance  in  the  DG  NUMA 
model. 


4  PBNO  and  Comparison  Preconditioner  Development 

4.1  PBNO  Optimization  Problem  Formulation 


In  the  nonlinear  least  squares  approach  we  develop  here,  we  begin  formally  by  replacing 
optimization  problem  (12)  with 

kopt<=  min  ||1  -  As(A))||p,  (15) 

AGct(A) 


where  the  subscript  p  is  the  index  of  a  standard  Euclidean  p-norm.  If  matrix  A  is  M  x  M . 
and  we  choose  a  power  basis  form  for  K  as  shown  in  Eq.  (8),  then  optimization  problem 
(15)  effectively  becomes 


kopt  <=  min 


£|i-«(W 


(16) 


Because  of  the  poor  behavior  of  the  power  basis  when  doing  numerical  computations, 
we  will  instead  represent  coefficients  of  preconditioner  K  using  the  form 


where 


ra+1 

slW  =  £  g£j(a), 

i—1 


771+ 1 

+(A)= n 


(A  -  n.j) 

(nj  —  ni)  ’ 


(17) 


(18) 
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are  mth- order  Lagrange  polynomials  with  nodal  values  Hi  located  at  the  Chebysliev 
points  appropriately  translated  and  scaled  so  as  to  be  centered  in  the  interval  [|  Amin  l.|A  max  |  ]  • 
By  virtue  of  representing  K  via  Eq.  (17),  optimization  problem  (16)  is  replaced  by 


opt 


'  M 

E 

.i= 1 


1  —  sz,(A»)Aj| 


i/p 


(19) 


where  vector  copt  =  [ci . . .  cm+i]T  contains  the  optimal  coefficients  of  the  polynomial  in 
Eq.  (17). 

In  principle,  optimization  problem  (19)  provides  us  with  a  method  to  construct 
matrix  K  by  finding  the  coefficients  in  Eq.  (17),  but  it  is  likely  just  as  intractable  as 
optimization  problem  (12).  Therefore,  in  order  to  advance  the  construction  we  replace 
the  nonlinear  least  squares  problem  (19)  with  a  tractable  proxy  via  the  procedure  that 
follows. 

Recall  that  the  Arnoldi  algorithm  applied  to  an  arbitrarily  large  M-by-M  matrix  A 
generates  the  factorization 

AQ  =  QH ,  (20) 

where  H  is  an  upper  Hessenberg  matrix  of  size  N-by-N  and  N  is  relatively  small  (N  = 
150  in  this  paper).  This  upper  Hessenberg  matrix  has  a  spectrum 

&(H)  =  [hi  ■  ■  -Hn]T  ,  (21) 


that  can  be  quickly  computed  in  its  entirety  using  the  QR  algorithm  [18,  p.  211-224]. 

A  key  feature  of  cr(H)  is  that  it  provides  a  good  discrete  approximation  to  the  hull 
of  the  spectrum  of  the  matrix  A  as  long  as  N  is  adequately  large.  This  property  of  cr(H) 
is  illustrated  in  Fig.  2  for  the  scaled  system  matrix  A  associated  with  the  RTB  problem. 

The  combination  of  the  fact  that  cr(H)  contains  the  eigenvalues  of  H  farthest  from 
(1,  0)  on  the  complex  plane,  and  the  fact  that  the  residual  polynomial 


r( A)  =  1  -  sL{ A)A, 


(22) 


contained  inside  optimization  problem  (19)  tends  to  become  large  for  those  eigenvalues 
farthest  from  (1, 0),  suggests  that  we  replace  (19)  with  the  very  tractable  proxy  problem 


-’Opt 


‘  N 

E 


1  -  sL{ni)ni \ 


i  /p 


(23) 


We  note  that  we  do  not  need  in-depth  information  about  the  operator  in  order  to 
proceed.  In  order  to  construct  the  proxy  problem  we  require  only  the  ability  to  apply 
the  underlying  operator  and  therefore  this  approach  satisfies  Premise  2. 

In  concluding  this  subsection  we  note  that  if  cr(H)  is  complex,  then  optimization 
problem  (23)  must  be  solved  using  a  nonlinear  least  squares  (NLLS)  approach  for  all 
values  of  norm  index  p  >  1.  If  cr(H)  is  real,  then  a  NLLS  approach  must  be  used  for  all 
values  of  p  >  2. 


4.2  PBNO  Optimization  Problem  Solution 

We  can  iteratively  solve  optimization  problem  (23)  for  any  even  value  of  norm  index 
p  >  2  using  the  Gauss-Newton  (GN)  algorithm  as  follows: 
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1.  Create  the  vector- valued  cost  function 


h{c) 


|1  -  sL{n i)Mi|p/2  ,  -  -  - ,  |1  —  sl((jln)/j, N\p/2 


(24) 


where  vector  c  =  [ci . . .  cm+i]T  contains  the  coefficients  q  in  Eq.  (17)  and  the  expo¬ 
nent  form  p/2  allows  for  the  fact  that  the  GN  algorithm  minimizes  the  square  of  the 
norm  of  the  residual  vector. 

2.  Approximate  h(c)  as  a  lst-0rcler  Taylor  series 


h(c)  =  h(c0 )  +  J Ac, 


(25) 


where  c0  is  an  appropriate  lst-guess4  for  c,  and  J  is  a  finite-difference  approximated 
Jacobian  matrix  given  by 


J 


dh  dh  dh 

dci  'dc-A  dcm+ 1 


J  C— C0 


(26) 


3.  Obtain  the  QR  factorization  of  the  Jacobian  J  =  QR  and  apply  the  minimum 
residual  criterion  QTh(c)  =  0  to  Eq.  (25)  to  obtain  the  normal  system 


RAc  =  —QTh(c0), 


(27) 


which  is  then  solved  for  Ac  by  back  substitution. 

4.  If  necessary,  reduce  Ac  by  repeated  factors  of  1/2  until  the  descent  criterion 


\\h{c0  +  Ac)\\2  <  ||h(c0)||2 


(28) 


is  satisfied. 

5.  Replace  c0  by  c0  +  Ac  and  repeat  steps  2-4  until  the  relative  error  criterion 


MM 

IIM 


(29) 


is  satisfied. 

6.  Convert  the  copt  obtained  from  Steps  1-5  into  the  equivalent  power  basis  coefficient 
vector  kopt  so  that  Eq.  (8)  can  be  used  when  actually  applying  the  PBNO  precondi¬ 
tioner  (via  Horner’s  Method). 


In  all  cases  we  found  that  letting  e  =  10-5  in  GN  convergence  criterion  (29)  was  sufficient 
to  ensure  the  GN  algorithm  produced  a  solution  copt  in  optimization  problem  (23)  that 
was  of  adequate  precision  for  use  in  Eq.  (17).  The  number  of  GN  iterations  required 
to  satisfy  criterion  (29)  depends  on  the  order  m  of  the  PBNO  preconditioner  and  the 
norm  index  p1  varying  from  fewer  than  10  for  m  =  1  and  p  =  2,  to  on  the  order  of  50 
for  m  =  9  and  p  =  20. 5 

Fig.  3(a)-(b)  shows  the  effect  of  5t,l-order  PBNO  preconditioners  with  p  =  2  and 
p  =  20  on  the  spectrum  of  the  preconditioned  system  matrices  KA  and  KH  for  the 
RTB.  The  coefficients  of  these  four  PBNO  preconditioners  appear  in  Table  1  of  this 
section.  When  compared  to  the  spectra  unpreconditioned  system  matrices  A  and  H  in 
Fig.  2,  we  can  see  that  in  each  panel  of  Fig.  3  the  effect  of  the  PBNO  preconditioner  is 
to  drive  all  the  eigenvalues  of  K A  and  KH  closer  to  (1,0)  on  the  complex  plane  in  an 
effort  to  satisfy  criterion  (10).  Most  importantly,  notice  in  both  panels  of  Fig.  3  that  the 

4  For  all  cases  in  this  paper  we  use  cq  =  [1 . . .  1]T,  which,  by  virtue  of  the  form  of  Eq.  (17),  makes 
sl(  A)  =  1. 

5  The  range  of  iteration  values  just  cited  excludes  the  case  when  p  =  2  and  the  spectrum  of  A  is 
positive  definite,  in  which  case  optimization  problem  (23)  is  linear  and  GN  converges  in  a  single  step. 
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a)  b) 

Fig.  3  (a)  Spectra  for  the  preconditioned  system  matrices  KA  (red  crosses)  and  KH  (green  circles) 
where  A  is  the  same  system  matrix  as  in  Fig.  (2b),  and  where  K  is  a  5th -order  PBNO  preconditioner 
with  the  norm  index  p  =  2.  (b)  As  in  panel  (a)  of  this  figure,  except  for  a  PBNO  preconditioner  using 
a  norm  index  p  =  20. 


p  Norm 

ko 

ki 

/C2 

^'3 

&4 

&5 

P  =  2 

3.8319 

-8.0515 

10.161 

-7.4160 

2.8788 

-0.45999 

-e 

II 

to 

o 

5.1638 

-13.367 

19.799 

-16.502 

7.1857 

-1.2669 

Table  1  Coefficient  values  in  Eq.  (8)  for  the  5  -order  PBNO  preconditioners  that  generated  the 
spectra  shown  in  Fig  (3a-b). 


spectral  hull  of  KH  (green  circles)  lies  virtually  on  top  of  the  spectral  hull  of  KA  (red 
crosses).  This  observation  supports  the  assumption  that  tractable  optimization  problem 
(23)  is  a  suitable  proxy  for  the  intractable  problem  (16). 

By  comparing  Fig.  3(a)  with  3(b)  we  can  readily  see  that  increasing  the  norm  index 
from  p  =  2  and  p  =  20  in  optimization  problem  (23)  results  in  a  more  symmetric 
distribution  of  eigenvalues  around  (1,0).  This  is  to  be  expected  since  increasing  the  p- 
norm  index  effectively  results  in  a  heavier  weighting  of  the  least  squares  residual  vector 
components  associated  with  the  eigenvalues  farthest  from  (1,0).  We  will  see  in  Section 
5  that  the  ability  to  change  norm  index  p  will  allow  us  to  create  the  optimal  PBNO 
preconditioner  for  each  of  the  GMRES,  BICGS,  and  RICH  iterative  solvers  that  we  will 
employ. 


4.3  PBNO  Preconditioner  Construction  and  Application 

For  preconditioned  system  matrix  spectra  (red  crosses)  shown  in  Fig.  3a-b,  the  reso¬ 
lution  of  the  RTB  problem  was  sufficiently  coarse  so  that  we  could  actually  construct 
matrix  KA  and  extract  its  entire  spectrum  via  the  QR  method  for  comparison  with  the 
spectrum  of  the  preconditioned  Hessenberg  matrix  KH.  For  higher  resolution  test  case 
problems,  and  for  any  operational  atmospheric  prediction  problem,  we  must  begin  with 
Eq.  (2)  in  the  more  numerically  relevant  form 

La(x)  =  b,  (30) 

where  La{  )  is  the  matrix- free  linear  operator  that  accomplishes  the  transformational 
action  of  the  unsealed  system  matrix  A.  In  this  case  the  method  for  constructing  and 
applying  the  PBNO  preconditioner  is  as  follows: 
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1.  Prior  to  the  time  integration  phase  of  the  numerical  model,  construct  an  unsealed 
Hessenberg  matrix  H  via  an  operator-based  implementation  of  the  Arnoldi  algo¬ 
rithm,  namely: 

La(Q)  =  QH.  (31) 

2.  Apply  the  QR  method  to  H  to  obtain  A  max  and  A  max  in  order  to  construct  A  mid  in 
Eq.  (4),  which  in  turn  enables  us  to  create  a  scaled  system  matrix  operator 

LA(  )  =  ),  (32) 

Amid 

for  use  during  preconditioner  construction  and  at  each  step  of  the  time  integration 
phase  of  the  DG  model. 

3.  Employ  the  method  outlined  in  Sections  4.1  and  4.2  to  determine  the  coefficients  of 
the  PBNO  preconditioner  for  the  values  of  polynomial  order  m  and  norm  index  p 
specified  by  the  user.  We  note  here  that  Eq.  (20)  is  replaced  by  the  operator-based 
form 

LA{Q)  =  QH,  (33) 

and  also  emphasize  that  the  only  quantities  that  require  storage  for  later  use  in  the 
time-integration  phase  of  the  numerical  model  are  the  small  number  (<  10)  of  scalar 
coefficients  that  constitute  the  components  of  the  coefficient  vector  copt. 

4.  After  the  preconditioner  is  constructed  via  Steps  1-3  above,  it  is  then  employed  at 
each  step  in  the  time-integration  phase  to  solve  the  operator-based  form  of  Eq.  (7) 

Lk  o  La(x)  =  LK(b),  (34) 

via  the  particular  iterative  solver  specified  by  the  user.  Here  it  is  important  to  em¬ 
phasize  that  the  operator  L x  is  implemented  iteratively  using  Horner’s  Method,  and 
thus  never  requires  more  processor  memory  than  the  modest  amount6  required  by 
La,  regardless  of  the  polynomial  order  m  of  the  PBNO  preconditioner  K. 

We  emphasize  that  for  Steps  1-3  listed  above,  the  greatest  computational  cost  by  far 
is  incurred  in  Step  1  when  the  matrix  H  is  constructed  via  the  Arnoldi  algorithm,  partic¬ 
ularly  if  system  matrix  A  is  large.  Since  we  use  a  150  x  150  Hessenberg  matrix,  this  cost 
is  approximately  equivalent  to  running  GMRES  for  150  iterations.  Similar  costs  would 
also  be  incurred  when  constructing  other  polynomial  preconditioners  ( e.g .  Chebyshev) 
using  linear  least-squares  methods,  as  well  as  the  cost  to  compute  the  spectrum  of  H 
in  Step  2.  The  cost  to  complete  Step  3,  which  is  the  only  step  unique  to  the  NLLS- 
based  PBNO  preconditioner,  is  negligible  compared  to  the  cost  of  Step  1.  Moreover, 
as  we  show  in  Section  5.4,  the  total  cost  of  constructing  the  PBNO  preconditioner  is 
completely  amortized  very  early  in  just  a  single  run  of  the  DG  model  provided  that  a 
realistically  large  Courant  Number  is  utilized. 

Fig.  4  shows  the  impact  on  the  spectrum  of  KH  due  to  a  set  of  PBNO  preconditioners 
of  increasing  order  constructed  for  a  DG  model  RTB  problem  with  five  times  the  spatial 
resolution  of  the  analogous  coarse-resolution  model  having  the  spectra  shown  in  Fig.  3. 
Notice  the  similarity  of  the  spectra  of  the  unpreconditioned  scaled  Hessenberg  matrix  H 
in  Fig.  4(a)  with  the  analogous  matrix  H  for  the  course-resolution  model  shown  in  Fig. 
2.  Likewise,  note  the  similarity  of  the  spectra  of  the  matrices  KH  (green  circles)  in  Fig. 
3(b)  and  Fig.  4(d),  where,  in  each  case  a  5th-order  PBNO  preconditioner  with  p  =  20 
was  employed.  Finally,  notice  in  Fig.  4(f)  the  high  degree  of  axisymmetry  about  (1,0) 
exhibited  by  the  spectrum  of  H  when  a  9tft--order  PBNO  preconditioner  is  employed. 

6  Recall  that  in  NUMA  neither  the  unsealed  system  matrix  A  nor  the  scaled  system  matrix  A  is  ever 
constructed. 
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a)  b)  c) 


d)  e)  f) 

Fig.  4  From  top  left  to  right  (a)-(c)  and  bottom  left  to  right  (d)-(f).  Spectrum  of  the  scaled  150  X  150 
Hessenberg  matrix  H  derived  from  a  DG  model  of  the  RTB  problem  using  25  X  25  elements,  5  -order 
polynomials  (based  on  LGL  points),  2nd-order-accurate  ARK  time  differencing,  and  a  Courant  No.  of 
16.  The  corresponding  scaled  system  matrix  A  has  dimensions  of  22,  550  X  22,  500. 


4.4  Comparison  GLS  Preconditioners 

Here  we  outline  the  method  set  forth  by  van  Gijzen  [5],  which  constructs  a  GLS  pre¬ 
conditioner  by  numerical  means,  and  has  the  advantage  of  allowing  us  to  choose  any 
positive  weighting  function  w{ A)  we  desire.  The  first  step  is  to  view  the  polynomial 
portion  of  the  integrand  in  optimization  problem  (13)  as  the  function 

/(fc,A)  =  l-As(A),  (35) 

and  then  to  create  the  inner  product-based  cost  function 

F(k)=  J^f(k1X)J{k^)w(X)d\X\.  (36) 

For  an  mth- order  GLS  preconditioner,  the  kopt  that  minimizes  the  cost  function  is  the 
solution  to  the  (m+l)  x  (m+l)  linear  system  arising  from  the  stationary-point  conditions 

dF 

—  =  0  ie  [0, m] .  (37) 

The  reader  is  referred  to  van  Gijzen  [5,  p.  96-97]  for  the  system-matrix  entry  and  right- 
hand  side  vector  component  formulas  resulting  from  Eqs.  (37)  when  preconditioner  K 
is  represented  via  a  power  basis  as  in  Eq.  (8).  Although  the  use  of  a  power  basis  has  the 
potential  for  numerical  instability  as  polynomial  order  m  becomes  large,  we  have  verified 
van  Gijzen’s  claim  that  no  such  problem  occurs  for  lower  order  polynomials.  Specifically, 
when  r  =  [0,4]  and  a  Chebyshev  w{ A)  is  used,  our  numerical  computations  for  m  <  8 
approximate  the  exact  results  presented  by  Saad  [16,  p.  385]  out  to  8  significant  digits. 

Using  the  above  approach  we  have  constructed  GLS  preconditioners  using  both 
uniform  weighting  (hereafter  GLSU)  and  generalized  Chebyshev  weightings  (hereafter 
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GLS  Variant 

ko 

ki 

/C2 

k3 

/C4 

&5 

GLSU 

3.5599 

-7.0695 

8.4318 

-5.9007 

2.2272 

-0.3494 

GLSC 

3.4839 

-7.2185 

9.0164 

-6.4720 

2.4619 

-0.3848 

Table  2  Coefficient  values  in  Eq.  (8)  for  the  5  -order  GLSU  and  GLSC  preconditioners  created  using 
the  octagonal  contour  r  shown  in  Fig.  2. 


GLSC)  applied  to  all  sides  of  the  octagonal  contour  r  shown  in  Fig.  2.  The  coefficients 
of  resulting  kopt  in  each  case  are  shown  in  Table  2  for  comparison  with  the  analogous 
PBNO  preconditioner  coefficients  shown  in  Table  1. 


5  Results  in  a  Serial  Computing  Environment 

In  order  to  adequately  map  out  the  large  parameter  space  associated  with: 

—  the  DG  variant  of  NUMA2D7 

—  a  range  of  possible  time-steps  with  Courant  No.  as  high  as  32, 

—  three  iterative  solvers  (GMRES,  BICGS,  RICH), 

—  preconditioner  order  as  high  as  9t,l-order, 

—  and  the  adjustable  p-norm  index  in  PBNO  optimization  problem  (23), 


we  initially  use  a  coarse  spatial  resolution,  a  low  order  time  integration  scheme,  and  a 
relatively  short  simulation  time  of  100s.  The  model  domain  is  spatially  discretized  using  a 
25-by-25  element  grid  and  5t/l-order  Lagrange  polynomials  (based  on  LGL  points)  within 
each  element,  and  a  2nd-order  accurate  Additive  Runge-Kutta  (ARK2)  time  integration 
scheme  [8].  A  set  of  six  different  time-steps  is  employed  and  corresponds  to  a  Courant 
No.  as  small  as  2  and  as  large  as  32. 8  As  the  analysis  proceeds  we  will  include  selected 
results  that  employ  9t,l-order  spatial  resolution,  higher  order  ARK  methods,  and  a  model 
simulation  time  of  700s  to  show  that  conclusions  based  on  the  coarse  simulation  runs 
are  justified. 


5.1  Solver  Performance  Dependence  on  PBNO  p-Norm  Index 

Recall  that  in  optimization  problem  (23)  the  index  of  the  p-norm  employed  can  be  varied. 
The  effect  of  changing  index  p  on  the  performance  of  the  GMRES,  BICGS,  and  RICH 
iterative  solvers  using  PBNO  preconditioners  of  up  to  9tft--order  is  illustrated  in  Fig.  5  for 
DG-NUMA2D.  Increasing  the  value  of  index  p  from  2  to  10  to  20  moderately  improves 
the  performance  of  GMRES  (Fig.  5(a)-(b))  and  BICGS  (Fig.  5(c)-(d)),  and  significantly 
improves  the  performance  of  RICH  (Fig.  5(e)-(f)).  Based  on  the  results  shown,  a  setting 
of  p  —  10  would  be  appropriate  when  using  either  the  GMRES  or  BICGS  solver,  and 
a  setting  of  p  =  20  would  be  an  appropriate  choice  when  using  the  RICH  solver.  In 
comparing  Figs.  5(b)  and  5(d),  it  is  important  to  note  that  the  wall-clock  time  of  the 
RICH  solver  (with  p  =  20)  is  of  the  same  order  as  both  GMRES  and  BICGS  regardless 
of  PBNO  preconditioner  order  m.  A  summary  of  the  p-norm  index  values  used  in  all 
subsequent  model  runs  are  provided  in  Table  3. 

7  NUMA2D  refers  to  the  two-dimensional  version  of  the  NUMA  model. 

8  A  Courant  No.  of  32  is  the  largest  for  which  the  RTB  problem  will  run  to  completion  (i.e.,  bubble 
at  top  of  domain  at  700s)  without  exceeding  the  CFL  limit  of  the  explicit  part  of  the  IMEX  method. 
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Fig.  5  (a)-(b)  Effect  of  PBNO  preconditioner  p-norm  index  on  GMRES  Iterations/Time-step  and  Wall 
Clock  Time,  respectively,  when  employing  a  time-step  corresponding  to  a  Courant  No.  of  16.  (c)-(d) 
As  in  (a)-(b),  except  for  the  BICGS  iterative  solver.  The  missing  data  points  for  PBNO  order  m  =  0 
reflects  the  fact  that  unpreconditioned  BICGS  solver  would  not  reliably  converge  at  every  time-step. 
(e)-(f)  As  in  (a)-(b),  except  for  the  RICH  iterative  solver.  The  missing  data  points  reflect  the  fact  that 
the  RICH  solver  would  not  converge  for  PBNO  order  m  <  2. 


5.2  Comparison  of  PBNO  and  GLS  Preconditioner  Performance 

In  Figure  7  we  provide  comparisons  of  how  effective  3rd-order  and  74ft--order  PBNO  and 
GLS  preconditioners  are  in  improving  the  convergence  rate  of  the  GMRES  (Fig.  7(a)-(b)) 
and  RICH  (Fig.  7(c)-(d))  iterative  solvers  for  one  particular  time-step  in  the  RTB  model. 
It  is  important  to  emphasize  that  because  the  system  matrix  H  remains  unchanged 
throughout  the  entire  model  run,  the  ’’snap-shots”  provided  in  Fig.  7  are  completely 
representative  of  the  relative  performance  of  the  preconditioners  at  all  time-steps  during 
the  time  integration  process.  The  performance  comparison  using  the  GMRES  solver  (Fig. 
7(a)-(b))  clearly  reveals  that  for  both  3rd-order  and  7t/l-order  polynomials: 

—  The  GLSC  preconditioner  makes  essentially  negligible  improvement  over  the  GLSU 
preconditioner,  which  indicates  that,  at  least  for  the  RTB  problem,  the  mathematical 
advantages  of  choosing  a  Chebyshev  weighing  function  do  not  translate  into  any 
meaningful  performance  improvement  over  a  uniform  weighting  function. 

—  The  PBNO  preconditioner  requires  approximately  20%  fewer  iterations  of  GMRES 
to  achieve  a  relative  residual  of  10-6  compared  to  GLSU. 

The  relative  performance  of  the  PBNO,  GLSU,  and  GLSC  preconditioners  using  the 
BICGS  solver  were  found  to  be  essentially  the  same  as  for  GMRES  and  are  not  shown. 
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GMRES 

BIGCS 

RICH 

Norm  index 

p  =  10 

p=  10 

p  =  20 

Table  3  Norm  index  values  used  in  optimization  problem  (23)  when  constructing  PBNO  precondition¬ 
ers  for  the  three  different  iterative  solvers. 


The  performance  comparison  using  the  RICH  solver  (Fig.  7(c)-(d))  clearly  reveals: 

—  The  RICH  solver  will  not  converge  without  preconditioning 

—  The  GLSC  preconditioner  actual  performs  slightly  worse  than  the  GLSU  precondi¬ 
tioner.  This  is  because  although  the  GLSC  produces  a  tighter  spectral  grouping,  it 
fails  to  center  the  grouping  within  the  unit  circle  about  (1,0),  which  is  essential  for 
accelerating  the  RICH  solver. 

—  The  PBNO  preconditioner  requires  approximately  30%  fewer  iterations  of  RICH 
to  achieve  a  relative  residual  of  10-6  compared  to  GLSU.  This  increase  in  RICH 
solver  convergence  rate  illustrates  the  fact  that  by  design  the  PBNO  preconditioner 
attempts  to  center  the  spectrum  of  the  preconditioned  system  matrix  KA  within  the 
unit  circle  about  (1,0)  (recall  Fig.  4e  for  a  7t,l-order  PBNO  preconditioner). 

Since  the  foregoing  results  clearly  indicate  that  the  PBNO  preconditioner  will  signif¬ 
icantly  outperform  the  comparison  GLSU  and  GLSC  preconditioners  at  all  time-steps, 
no  further  reference  to  the  GLS  preconditioners  will  be  made,  and  all  the  results  that 
follow  focus  on  analyzing  the  properties  of  the  PBNO  preconditioner  with  regard  to  the 
choice  of  iterative  solver,  time  integrator  order,  model  initial  conditions,  etc. 


5.3  Relative  Performance  of  PBNO-preconditioned  Solvers 

The  effect  of  the  PBNO  preconditioners  on  the  iterations/time-step  and  wall  clock  time 
of  GMRES,  BICGS,  and  RICH  solvers  as  a  function  of  RTB  problem  Courant  No.  are 
shown  in  Fig.  7.  Comparing  Figs.  7(b), (d)  and  (f)  reveals  that  in  terms  of  minimum 
wall  clock  time  all  three  solvers  run  more  efficiently  when  higher  order  preconditioning 
is  combined  with  large  Courant  No.  In  particular: 

—  when  employing  GMRES  the  model  runs  the  fastest  when  a  9tft--order  preconditioner 
is  combined  with  the  maximum  Courant  No.  of  32. 

—  when  employing  BICGS  the  model  runs  the  fastest  when  a  7ift--order  preconditioner 
is  combined  with  the  maximum  Courant  No.  of  32. 

—  when  employing  RICH  the  model  runs  the  fastest  when  a  9t,l-order  preconditioner 
is  combined  with  the  maximum  Courant  No.  of  32. 

When  looking  over  Figs.  7(b),  (d)  and  (f)  it  is  clear  that  the  lowest  wall  clock  time 
tends  to  occur  when  using  the  largest  Courant  No.  of  32  (and  thus  longest  time-step). 
However,  the  preconditioner  order  needed  to  achieve  the  lowest  wall  clock  time  when 
running  the  model  at  Courant  No.  32  varies  with  the  iterative  solver  used.  Figure  8 
provides  an  example  of  the  PBNO-preconditioned  model  performance  when  running  at 
the  maximum  Courant  No.  of  32  and  employing  an  ARK4  [10]  time  integrator.  With 
regard  to  iterations  per  time-step  (Fig.  8(a)),  important  points  to  note  are  that: 

—  The  performance  of  the  RICH  solver  is  only  slightly  worse  than  the  GMRES  and 
BICGS  solvers  for  low  PBNO  order,  but  nearly  matches  the  performance  of  GMRES 
for  PBNO  order  greater  than  4. 
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Fig.  6  (a)  A  representative  example  of  the  relative  residual  convergence  rates  exhibited  by  the  GMRES 
iterative  solver  at  each  time-step  of  the  RTB  problem  using  the  3rd-order  preconditioners  shown  in  the 
legend  and  the  following  model  specifications:  25-by-25  element  grid  and  5  -order  Lagrange  polynomials 
(based  on  LGL  points)  within  each  element,  an  ARK2  time  integrator  and  a  Courant  No.  of  16.  (b)  As 
in  (a),  except  for  7th -order  preconditioners,  (c)-(d)  As  in  (a)  and  (b)  respectively,  except  for  using  the 
RICH  iterative  solver. 


—  Although  BICGS  has  significantly  fewer  iterations  per  time-step  than  GMRES,  it 
must  be  remembered  that  this  advantage  tends  to  be  offset,  to  some  degree,  by  the 
fact  that  BICGS  applies  the  preconditioned  system  matrix  twice  at  each  time-step. 

With  regard  to  wall  clock  time  (Fig.  8(b)),  important  points  to  note  are  that: 

—  The  lst-order  preconditioning  of  GMRES  or  BICGS  results  in  a  dramatic  reduction 
in  wall  clock  time  compared  to  unpreconditioned  GMRES  (the  only  solver  for  which 
DG-NUMA2D  would  run  without  preconditioning). 

—  The  performance  of  all  three  solvers  is  very  similar  for  PBNO  order  m  >  2  and  we 
can  see  that  a  modest  reduction  in  wall  clock  time  occurs  as  PBNO  order  increases. 

—  The  competitiveness  of  the  RICH  algorithm  for  PBNO  order  m  >  2  is  particularly 
noteworthy  given  that  the  algorithm  is  run  in  a  dot-product-free  mode  in  NUMA2D. 


5.4  PBNO  Preconditioner  Construction  Cost  Amortization 

Recall  that  Premise  3  formalizes  the  assumption  that  the  same  preconditioner  can  be 
used  many  times,  and  that  one  might  expect  that  this  will  amortize  the  cost  of  con- 
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Fig.  7  (a)-(b)  Iterations/time-step  (a)  and  wall  clock  time  (b)  as  a  function  of  Courant  No.  exhibited 
by  DG-NUMA2D  preconditioned  with  the  orders  of  the  PBNO-preconditioner  shown  in  the  legend. 
For  these  results  DG-NUMA2D  employed  5^-order  Lagrange  polynomials  (based  on  LGL  points),  an 
ARK2  time  integrator,  and  RTB  simulation  time  of  100s.  (c)-(d)  As  in  (a)  and  (b),  except  for  using 
the  BICGS  solver.  The  missing  data  points  for  0-order  preconditioning  indicates  that  the  model  would 
not  run  to  completion  when  employing  the  unpreconditioned  BICGS  solver,  (e)-(f)  As  in  (a)  and  (b), 
except  for  using  the  RICH  solver. 


Fig.  8  Comparison  of  iterations/time-step  (a)  and  wall  clock  time  (b)  as  a  function  of  PBNO  pre¬ 
conditioner  order  for  DG-NUMA2D.  Missing  data  points  indicate  that  the  model  would  not  run  to 
completion  using  that  particular  combination  of  iterative  solver  and  PBNO  order.  The  model  employed 
5^-order  Lagrange  polynomials  (based  on  LGL  points),  a  Courant  No.  of  32,  an  ARK4  time  integrator, 
and  RTB  simulation  time  of  100s. 
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Fig.  9  (a)-(d)  Graph  of  wall-clock  time  versus  model  simulation  time  when  running  the  RTB  problem 
using  unpreconditioned  GMRES  (red-circles)  and  GMRES  using  a  9t,l-order  PBNO  preconditioner 
(green  squares)  for  the  various  combinations  of  Courant  No.  and  ARK  time  integrator  accuracy  shown 
in  the  upper  left  of  each  panel.  The  horizontal  distance  between  adjacent  circles  or  squares  indicates 
the  length  of  the  time-step  employed.  In  all  the  panels  the  model  domain  is  spatially  discretized  using  a 
25-by-25  element  grid  and  5t,l-order  Lagrange  polynomials  (based  on  LGL  points)  within  each  element. 


structing  the  preconditioner  over  a  great  number  of  repeated  applications.  The  alge¬ 
braic  justification  for  this  premise  is  that,  within  a  suitably  formulated  IMEX  modeling 
context,  the  linear  operator  represented  by  matrix  A  in  Eq.  (7)  does  not  vary  from  time- 
step  to  time-step,  and  thus  matrix  K  (i.e.,  the  PBNO  preconditioner)  does  not  need  to 
change.  Verification  of  the  validity  of  Premise  3  is  provided  in  Figs.  9(a)-(d)  which  plots 
wall-clock  time  as  a  function  of  model  simulation  time  for  runs  of  NUMA2D-DG  that 
employ  both  unpreconditioned  GMRES  (red  circles)  and  GMRES  preconditioned  with 
a  9tft--order9  PBNO  preconditioner  (green  squares). 

First,  notice  that  all  the  graphs  are  nearly  straight  lines,  which  reflects  the  fact 
that  matrix  A  in  Eq.  (7)  does  not  change.  The  slight  departures  from  linearity  are  a 
manifestation  of  the  fact  that  the  average  number  of  GMRES  iterations  per  time-step 
has  a  slight  dependence  on  vector  b  in  Eq.  (7),  which  changes  from  time-step  to  time-step 
due  to  its  dependence  on  the  evolving  RTB  state  vector  (recall  Eqs.  (1)  and  (2)). 

Notice  that  in  each  panel  of  Fig.  9  the  preconditioned  GMRES  graph  starts  with 
a  wall-clock  time  of  approximately  4  seconds,  which  reflects  the  upfront  cost  of  con¬ 
structing  the  PBNO  preconditioner  before  RTB  simulation  begins.  However,  the  key 
observation  is  that  in  each  panel  the  preconditioned  GMRES  wall-clock  time  graph 
crosses  the  unpreconditioned  GMRES  wall-clock  time  graph  within  the  first  few  seconds 
of  model  simulation  time.  At  the  cross-over  point  the  cost  of  constructing  the  PBNO 
preconditioner  has  been  completely  amortized,  which  means  that  after  the  cross-over 
point  the  preconditioned  model  is  requiring  less  wall-clock  time  than  the  unprecondi- 

9  By  making  the  polynomial  order  the  largest  we  consider  in  this  paper  (i.e.,  9  -order)  we  maximize 
the  number  of  Gauss-Newton  iterations  required  during  the  construction  of  the  preconditioner,  and  thus 
maximize  the  construction  cost. 
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tioned  model.  For  the  2"d-order-accurate  ARK  time  integrator,  when  we  use  a  time-step 
corresponding  to  a  Courant  No.  of  8,  the  cross-over  point  occurs  after  34  time-steps  (Fig. 
9(a));  when  we  increase  the  Courant  No.  to  32  the  cross-over  point  drops  dramatically 
to  just  after  the  1st  time-step  (Fig.  9(b)).  For  the  4tft-order- accurate  ARK  time  inte¬ 
grator,  when  using  a  time-step  corresponding  to  a  Courant  No.  of  8  the  cross-over  point 
occurs  after  only  14  time-steps  (Fig.  9(c));  increasing  the  Courant  No.  to  32  brings  the 
cross-over  point  to  before  the  completion  of  the  1st  time-step  (Fig.  9(d)).  The  increasing 
rapidity  with  which  the  construction  cost  of  the  PBNO  preconditioner  is  amortized  as 
ARK  order  and  Courant  No.  are  increased  reflects  the  fact  that  the  construction  cost  of 
the  preconditioner  is  roughly  equivalent  to  150  GMRES  iterations  (as  determined  by  the 
size  of  H )  regardless  of  the  Courant  No.  or  ARK  order  used.  By  contrast,  the  average 
number  of  GMRES  iterations  per  time-step  increases  as  either  the  order  of  the  ARK 
method  or  the  size  of  the  Courant  No.  are  increased. 

The  results  in  this  subsection  clearly  show  that  Premise  3  is  easily  satisfied  even 
if  the  PBNO  preconditioner  is  applied  to  only  one  run  of  NUMA2D-DG.  In  the  next 
subsection,  we  show  that,  once  constructed,  the  same  PBNO  preconditioner  can  be  used 
for  multiple  runs  of  the  model  with  different  initial  conditions,  thus  making  the  cost  of 
constructing  the  preconditioner  even  more  inconsequential. 


5.5  Long-Duration,  High-Resolution  Simulation  Comparisons 

As  mentioned  earlier,  to  expeditiously  map  out  the  large  parameter  space  associated 
with  multiple  PBNO  variants,  iterative  solvers,  etc ,  we  initially  presented  results  based 
on  coarse  spatial  resolution,  a  low-order  time  integration  scheme,  and  relatively  short 
simulation  time  of  100s.  To  ensure  that  PBNO  preconditioning  up  to  order  m  =  9 
produces  stable  and  consistent  results  for  longer  RTB  simulation  times,  regardless  of  the 
iterative  solver  used,  we  ran  the  problem  to  700s,  by  which  time  the  bubble  has  already 
impacted  the  top  of  the  domain.  For  these  runs  we  employ  a  problem  domain  that  is 
spatially  discretized  using  a  25-by-25  element  grid  and  9tft-order  Lagrange  polynomials 
(based  on  LGL  points)  within  each  element,  and  we  use  a  4tft--order  accurate  Additive 
Runge-Kutta  (ARK4)  time  integration  scheme.10 

For  these  runs  we  also  used  two  different  initial  conditions: 

—  the  continuous  cosine-based  profile  shown  in  Fig.  10(a)  on  which  all  runs  shown 
earlier  in  this  paper  are  based, 

—  and  a  discontinuous  profile  shown  Fig.  10(b)  to  show  that  PBNO  preconditioned 
NUMA2D-DG  can  effectively  handle  strong  state  variable  gradients,  as  well  as  show 
that  changing  the  initial  condition  does  not  require  recomputation  of  the  PBNO 
preconditioner  coefficients. 

For  the  purpose  of  comparison,  Figures  10(c)  and  (d)  show  the  RTB  potential  tem¬ 
perature  fields  for  the  continuous  and  discontinuous  cases,  respectively,  at  100  seconds 
of  model  simulation  time,  which  is  well  before  significant  deformation  of  the  bubble 
occurs  in  either  simulation.  In  the  discontinuous  case  (Fig.  10(d)),  one  can  see  that  the 
inclusion  of  an  artificial  kinematic  viscosity  of  1.0  m2 / s  has  smoothed  the  initial  discon¬ 
tinuity  into  a  strong  gradient.11  A  smaller  artificial  kinematic  viscosity  of  0.01m2/s  is 

10  The  total  number  of  grid  points  in  this  simulation  is  50,625  with  each  grid  point  having  4  variables 
for  a  total  of  202,500  degrees  of  freedom.  Using  a  time-step  of  dt=0. 148287  means  that  the  linear  system 
has  to  be  solved  71  jj*'”  =  18,  882  times  during  the  700s  simulation,  where  k=4  denotes  the  number  of 
implicit  Runge-Kutta  stages  used  in  the  ARK  method. 

11  We  emphasize  here  that  the  use  of  artificial  viscosity  is  not  necessary  for  NUMA2D-DG  to  run  the 
discontinuous  RTB  case  to  completion,  and  has  no  effect  on  the  performance  of  the  PBNO  precondi- 


20 


L.E.  Carr,  C.F.  Borges,  and  F.X.  Giraldo 


Initial  Condition  Cross-section 


Q — 
CD 

a3 

Q_ 

E 


o 

CO 

Q 


■c 

(D 

Q_ 


Initial  Condition  Cross-section 


Horizontal  Position  (m) 


a) 


b) 


c)  d) 

Fig.  10  (a)-(b)  RTB  cross-sections  for  the  continuous  cosine-based  and  discontinuous  step- function- 
based  initial  conditions  respectively,  (c)-(d)  Potential  temperature  fields  at  100s  corresponding  to  the 
initial  conditions  shown  in  panels  (a)  and  (b),  respectively. 


also  used  in  the  continuous  case,  but  only  for  the  purpose  of  making  the  cost  of  running 
NUMA2D-DG  comparable  for  the  two  initial  conditions. 

Potential  temperature  fields  that  result  from  running  the  model  using  the  contin¬ 
uous  initial  condition  (Fig.  10(a))  and  employing  unpreconditioned  GMRES,  9t,l-order 
PBNO-preconditioned  GMRES,  BICGS,  and  RICH  are  depicted  in  Fig.  ll(a)-(d),  re¬ 
spectively.  The  fields  shown  are  at  the  600s  point  in  the  700s  simulation,  which  is  ap¬ 
proximately  the  time  when  the  bubble  attains  its  maximum  upward  speed.  It  is  visually 
evident  that  all  four  runs  closely  reproduce  the  fine  structure  of  the  rising  bubble.  The 
small  values  of  \AT\max  shown  in  Fig.  ll(b)-(d)  provide  numerical  confirmation  that 
94/t-order  preconditioning  of  the  GMRES,  BICGS,  and  RICH  solvers  is  not  causing  any 
significant  dispersion  or  phase  lag  relative  to  the  unpreconditioned  GMRES  field  (Fig. 
11(a)).  A  comparison  of  wall  clock  time  values  in  the  lower  right  of  each  panel  shows 
that  relative  to  unpreconditioned  GMRES  (which  is  the  only  solver  that  runs  without 
preconditioning  in  DG  NUMA2D): 

tioner.  Rather  the  numerical  viscosity  has  been  added  solely  for  the  purpose  of  avoiding  the  generation 
of  spurious  eddies  associated  with  imposing  a  curving  discontinuous  initial  condition  on  a  rectangularly 
discretized  model  domain,  and  thus  allowing  the  continuous  and  discontinuous  runs  to  produce  visually 
comparable  results. 
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a) 


b) 


c)  d) 

Fig.  11  (a)-(d)  Potential  temperature  fields  for  the  continuous  RTB  test  case  at  600s  in  a  700s  simu¬ 
lation  time  run  of  the  model.  Model  setup  specifications  are  as  described  in  the  text.  The  heading  on 
each  panel  identifies  the  iterative  solver  and  order  of  PBNO  preconditioning  employed.  Iterations  per 
time-step  and  wall  clock  time  in  seconds  appear  in  the  upper  and  lower  right  of  each  panel,  respectively. 
The  value  in  the  upper  left  of  panel  (a)  is  the  maximum  potential  temperature  in  the  bubble.  The  value 
in  the  upper  left  of  panels  (b)-(c)  is  the  infinity  norm  of  the  difference  between  the  unpreconditioned 
GMRES  potential  temperature  state  vector  in  (a)  and  the  respective  potential  temperature  state  vector 
in  panels  (b),  (c),  and  (d). 


—  preconditioned  GMRES  is  running  «  2.78  times  faster, 

—  preconditioned  BICGS  is  running  ss  2.40  times  faster, 

—  and  preconditioned  dot-product-free  RICH  is  running  ss  2.16  times  faster. 

A  comparison  of  average  iterations  per  time-step  values  in  the  upper  right  of  each 
panel  of  Fig.  11  shows  that  relative  to  unpreconditioned  GMRES: 


—  preconditioned  GMRES  uses  «  8.83  times  fewer  iterations, 

—  preconditioned  BICGS  uses  ss  14.7  times  fewer  iterations, 

—  and  preconditioned  RICH  uses  ss  6.37  times  fewer  iterations. 

Potential  temperature  Helds  that  result  from  running  the  model  using  the  discontin¬ 
uous  initial  condition  (Fig.  10(b))  and  employing  unpreconditioned  GMRES,  9t,l-order 
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GMRES  GMRES  usinq  PBNO-9 


BICGS  using  PBNO-9 


|AT|  =  4.04e-07  Iters/Step  =  28.64 


Model  Time  (s)  Wall  Time  (s) 

500  41134 


b) 

RICH  usinq  PBNO-9 


|AT|  =  4.44e-07  Iters/Step  =  65 


Model  Time  (s)  Wall  Time  (s) 

500  45153 


c) 


d) 


Fig.  12  (a)-(d)  As  in  Figs.  10(a)-(d),  except  for  the  discontinuous  RTB  test  case  at  500s. 


PBNO-preconditioned  GMRES,  BICGS,  and  RICH  are  depicted  in  Fig.  12(a)-(d),  re¬ 
spectively.  The  fields  shown  are  at  the  500s  point  in  the  700s  simulation,  which  is  ap¬ 
proximately  the  time  when  the  more  energetic  discontinuous  bubble  reaches  the  same 
height  and  state  of  development  as  the  less  energetic  continuous  bubble.  As  in  the  con¬ 
tinuous  RTB  case,  it  is  again  visually  evident  that  all  four  runs  closely  reproduce  the 
fine  structure  of  the  rising  bubble,  as  well  as  maintaining  the  strong  gradient  of  temper¬ 
ature  seen  at  100  s  in  Fig.  10(d).  The  small  values  of  \AT\max  shown  in  Fig.  12(b)-(d) 
provide  numerical  confirmation  that  (l^-order  preconditioning  of  the  GMRES,  BICGS, 
and  RICH  solvers  is  not  causing  any  significant  dispersion  or  phase  lag  relative  to  the 
unpreconditioned  GMRES  field  (Fig.  12(a)). 

As  we  conclude  this  section,  we  again  wish  to  emphasize  that  the  same  PBNO  pre¬ 
conditioner  that  was  constructed  for  the  continuous  bubble  case  was  also  utilized  in 
the  discontinuous  bubble  case,  and  could  have  been  used  with  an  unlimited  number  of 
different  initial  conditions  without  incurring  any  additional  construction  cost. 


6  Conclusion 

We  have  introduced  a  method  for  constructing  a  polynomial  preconditioner  using  a 
nonlinear  least  squares  (NLLS)  algorithm  and  have  shown  that  this  polynomial-based 
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NLLS-optimized  (PBNO)  preconditioner  significantly  outperforms  two  generalized  lin¬ 
ear  least  squares  (GLS)  preconditioners  when  running  a  2-D  discontinuous  Galerkin 
(DG)  compressible  flow  model  of  a  rising  bubble  (RTB)  test  case  using  implicit-explicit 
(IMEX)  time  integrators.  The  DG  model  with  PBNO  preconditioner  achieves  signifi¬ 
cant  reduction  in  GMRES  iteration  counts  and  model  wall-clock  time.  Comparisons  of 
the  ability  of  the  PBNO  preconditioner  to  improve  model  performance  when  employing 
BICGS  and  RICH  have  also  been  included.  In  particular,  we  have  shown  that  higher 
order  PBNO  preconditioning  of  RICH  (which  is  run  in  a  dot  product  free  mode)  makes 
the  algorithm  competitive  with  GMRES  and  BICGS  in  a  serial  computing  environment. 
In  addition,  since  the  method  used  to  construct  the  PBNO  preconditioner  can,  without 
modification,  handle  both  positive  definite  and  complex  spectra,  we  believe  that  the 
PBNO  preconditioning  approach  is,  for  certain  types  of  problems,  an  attractive  alterna¬ 
tive  to  existing  GLS  polynomial  preconditioners  based  on  linear  least-squares  methods. 
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